Home Depot Product Search Relevance

The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters.

LabGraph Create

This notebook uses the LabGraph create machine learning iPython module. You need a personal licence to run this code.


In [1]:
import graphlab as gl
from nltk.stem import *

Load data from CSV files


In [2]:
train = gl.SFrame.read_csv("../data/train.csv")


[INFO] This non-commercial license of GraphLab Create is assigned to thomasv1000@hotmail.fr and will expire on October 12, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-34069 - Server binary: /Users/tjaskula/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1455056183.log
[INFO] GraphLab Server Version: 1.8.1
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.123565 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv
PROGRESS: Parsing completed. Parsed 74067 lines in 0.1662 secs.

In [3]:
test = gl.SFrame.read_csv("../data/test.csv")


PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.210436 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
PROGRESS: Parsing completed. Parsed 166693 lines in 0.321425 secs.

In [4]:
desc = gl.SFrame.read_csv("../data/product_descriptions.csv")


PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.512102 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 61134 lines. Lines per second: 61129.8
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
PROGRESS: Parsing completed. Parsed 124428 lines in 1.5747 secs.

Data merging


In [5]:
# merge train with description
train = train.join(desc, on = 'product_uid', how = 'left')

In [6]:
# merge test with description
test = test.join(desc, on = 'product_uid', how = 'left')

Let's explore some data

Let's examine 3 different queries and products:

  • first from the training set
  • somewhere in the moddle in the training set
  • the last one from the training set

In [7]:
first_doc = train[0]
first_doc


Out[7]:
{'id': 2,
 'product_description': 'Not only do angles make joints stronger, they also provide more consistent, straight corners. Simpson Strong-Tie offers a wide variety of angles in various sizes and thicknesses to handle light-duty jobs or projects where a structural connection is needed. Some can be bent (skewed) to match the project. For outdoor projects or those where moisture is present, use our ZMAX zinc-coated connectors, which provide extra resistance against corrosion (look for a "Z" at the end of the model number).Versatile connector for various 90 connections and home repair projectsStronger than angled nailing or screw fastening aloneHelp ensure joints are consistently straight and strongDimensions: 3 in. x 3 in. x 1-1/2 in.Made from 12-Gauge steelGalvanized for extra corrosion resistanceInstall with 10d common nails or #9 x 1-1/2 in. Strong-Drive SD screws',
 'product_title': 'Simpson Strong-Tie 12-Gauge Angle',
 'product_uid': 100001,
 'relevance': 3.0,
 'search_term': 'angle bracket'}

'angle bracket' search term is not contained in the body. 'angle' would be after stemming however 'bracket' is not.


In [8]:
middle_doc = train[37033]
middle_doc


Out[8]:
{'id': 113228,
 'product_description': 'PureBond Plywood Project Panels are a convenient and cost-effective way to build cabinets, furniture and other woodworking projects. It provides a beautiful wood veneer face bonded to a strong and flat wood core. These PureBond Project Panels are made with no added formaldehyde, eliminating the concern about off-gassing dangerous fumes during fabrication or when installed in your home. Their smaller size makes them easy to handle and allows you to order just the amount of wood you need. PureBond plywood, in Project Panels sizes or in full sheet sizes, are a Home Depot exclusive.California residents: see Proposition 65 informationDecorative mahogany veneer applied to both sides of this panelB-2 plain sliced mahogany - 7-ply constructionLight weight, all-wood veneer constructionPrecision-cut hardwood plywood panels in convenient small sizesCommon: 3/4 in. x 2 ft. x 4 ft.; Actual: 0.703 in. x 24 in. x 48 in.Grade: B-2',
 'product_title': '3/4 in. x 2 ft. x 4 ft. PureBond Mahogany Plywood Project Panel',
 'product_uid': 137334,
 'relevance': 3.0,
 'search_term': 'table top wood'}

only 'wood' is present from search term


In [9]:
last_doc = train[-1]
last_doc


Out[9]:
{'id': 221473,
 'product_description': 'No. 918 Millennial Ryan heathered texture semi-sheer curtain is a casual solid that adds freshness and a finishing touch to any decor setting. Enhances privacy while allowing light to gently filter through. Clean, simple one-pocket pole top design can be used with a standard or decorative curtain rod. Mix and match with other solids and prints for a look that is all your own.Sheer panel, gently filters lightNo header pole top panelMachine washableWide array of colors to choose from100% polyesterContains 1-curtain panel',
 'product_title': 'LICHTENBERG Pool Blue No. 918 Millennial Ryan Heathered Texture Sheer Curtain Panel, 40 in. W x 63 in. L',
 'product_uid': 206650,
 'relevance': 2.33,
 'search_term': 'fine sheer curtain 63 inches'}

'sheer' and 'courtain' are present and that's all

How many search terms are not present in description and title for ranked 3 documents

Ranked 3 documents are the most relevents searches, but how many search queries doesn't include the searched term in the description and the title


In [10]:
train['search_term_word_count'] = gl.text_analytics.count_words(train['search_term'])
ranked3doc = train[train['relevance'] == 3]
print ranked3doc.head()
len(ranked3doc)


+-----+-------------+-------------------------------+
|  id | product_uid |         product_title         |
+-----+-------------+-------------------------------+
|  2  |    100001   | Simpson Strong-Tie 12-Gaug... |
|  9  |    100002   | BEHR Premium Textured Deck... |
|  18 |    100006   | Whirlpool 1.9 cu. ft. Over... |
|  21 |    100006   | Whirlpool 1.9 cu. ft. Over... |
|  27 |    100009   | House of Fara 3/4 in. x 3 ... |
|  35 |    100011   | Toro Personal Pace Recycle... |
|  37 |    100011   | Toro Personal Pace Recycle... |
|  65 |    100016   | Sunjoy Calais 8 ft. x 5 ft... |
| 123 |    100023   | Quikrete 80 lb. Crack-Resi... |
| 162 |    100029   | DecoArt Americana Decor 16... |
+-----+-------------+-------------------------------+
+--------------------------------+-----------+-------------------------------+
|          search_term           | relevance |      product_description      |
+--------------------------------+-----------+-------------------------------+
|         angle bracket          |    3.0    | Not only do angles make jo... |
|           deck over            |    3.0    | BEHR Premium Textured DECK... |
|         convection otr         |    3.0    | Achieving delicious result... |
|           microwaves           |    3.0    | Achieving delicious result... |
|            mdf 3/4             |    3.0    | Get the House of Fara 3/4 ... |
| briggs and stratton lawn mower |    3.0    | Recycler 22 in. Personal P... |
|            gas mowe            |    3.0    | Recycler 22 in. Personal P... |
|          grill gazebo          |    3.0    | Make grilling great with t... |
| CONCRETE & MASONRY CLEANER...  |    3.0    | Quikrete 80 lb. Crack-Resi... |
|          chalk paint           |    3.0    | Achieving a vintage, time-... |
+--------------------------------+-----------+-------------------------------+
+-------------------------------+
|     search_term_word_count    |
+-------------------------------+
|   {'bracket': 1, 'angle': 1}  |
|     {'over': 1, 'deck': 1}    |
|  {'otr': 1, 'convection': 1}  |
|       {'microwaves': 1}       |
|      {'mdf': 1, '3/4': 1}     |
| {'and': 1, 'stratton': 1, ... |
|     {'gas': 1, 'mowe': 1}     |
|   {'grill': 1, 'gazebo': 1}   |
| {'etcher': 1, 'cleaner': 1... |
|    {'chalk': 1, 'paint': 1}   |
+-------------------------------+
[10 rows x 7 columns]

Out[10]:
19125

In [11]:
words_search = gl.text_analytics.tokenize(ranked3doc['search_term'], to_lower = True)
words_description = gl.text_analytics.tokenize(ranked3doc['product_description'], to_lower = True)
words_title = gl.text_analytics.tokenize(ranked3doc['product_title'], to_lower = True)
wordsdiff_desc = []
wordsdiff_title = []
puid = []
search_term = []
ws_count = []
ws_count_used_desc = []
ws_count_used_title = []
for item in xrange(len(ranked3doc)):
    ws = words_search[item]
    pd = words_description[item]
    pt = words_title[item]
    diff = set(ws) - set(pd)
    if diff is None:
        diff = 0
    wordsdiff_desc.append(diff)
    
    diff2 = set(ws) - set(pt)
    if diff2 is None:
        diff2 = 0
    wordsdiff_title.append(diff2)
    
    puid.append(ranked3doc[item]['product_uid'])
    search_term.append(ranked3doc[item]['search_term'])
    ws_count.append(len(ws))
    ws_count_used_desc.append(len(ws) - len(diff))
    ws_count_used_title.append(len(ws) - len(diff2))
    
differences = gl.SFrame({"puid" : puid,
                         "search term": search_term,
                         "diff desc" : wordsdiff_desc,
                         "diff title" : wordsdiff_title,
                         "ws count" : ws_count, 
                         "ws count used desc" : ws_count_used_desc,
                         "ws count used title" : ws_count_used_title})

In [12]:
differences.sort(['ws count used desc', 'ws count used title'])


Out[12]:
diff desc diff title puid search term ws count ws count used desc
[recycling, bins] [recycling, bins] 145727 recycling bins 2 0
[over, deck] [over, deck] 100002 deck over 2 0
[hammer, electric, drill] [hammer, electric, drill] 120061 electric hammer drill 3 0
[microwaves] [microwaves] 100006 microwaves 1 0
[plywoods] [plywoods] 119996 plywoods 1 0
[coca, cola] [coca, cola] 120276 coca cola 2 0
[greenhouses] [greenhouses] 120318 greenhouses 1 0
[pipe, cutters] [pipe, cutters] 119840 pipe cutters 2 0
[buit, themostat, in] [buit, themostat, in] 206359 buit in themostat 3 0
[mowers, ridding] [mowers, ridding] 120366 ridding mowers 2 0
ws count used title
0
0
0
0
0
0
0
0
0
0
[19125 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [13]:
print "No terms used in description : " + str(len(differences[differences['ws count used desc'] == 0]))
print "No terms used in title : " + str(len(differences[differences['ws count used title'] == 0]))
print "No terms used in description and title : " + str(len(differences[(differences['ws count used desc'] == 0) & 
                                                                        (differences['ws count used title'] == 0)]))


No terms used in description : 2666
No terms used in title : 2152
No terms used in description and title : 1206

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline

Stemming


In [29]:
#stemmer = SnowballStemmer("english")
stemmer = PorterStemmer()
def stem(word):
    singles = [stemmer.stem(plural) for plural in unicode(word, errors='replace').split()]
    text = ' '.join(singles)
    return text

In [30]:
print "Starting stemming train search term..."
stemmed = train['search_term'].apply(stem)
train['stem_search_term'] = stemmed

print "Starting stemming train product description..."
stemmed = train['product_description'].apply(stem)
train['stem_product_description'] = stemmed

print "Starting stemming train product title..."
stemmed = train['product_title'].apply(stem)
train['stem_product_title'] = stemmed

print "Starting stemming test search term..."
stemmed = test['search_term'].apply(stem)
test['stem_search_term'] = stemmed

print "Starting stemming test product description..."
stemmed = test['product_description'].apply(stem)
test['stem_product_description'] = stemmed

print "Starting stemming test product title..."
stemmed = test['product_title'].apply(stem)
test['stem_product_title'] = stemmed


Starting stemming train search term...
Starting stemming train product description...
Starting stemming train product title...
Starting stemming test search term...
Starting stemming test product description...
Starting stemming test product title...

TF-IDF with linear regression


In [32]:
train['search_term_word_count'] = gl.text_analytics.count_words(train['stem_search_term'])
train_search_tfidf = gl.text_analytics.tf_idf(train['search_term_word_count'])

In [33]:
train['search_tfidf'] = train_search_tfidf

In [34]:
train['product_desc_word_count'] = gl.text_analytics.count_words(train['stem_product_description'])
train_desc_tfidf = gl.text_analytics.tf_idf(train['product_desc_word_count'])

In [35]:
train['desc_tfidf'] = train_desc_tfidf

In [36]:
train['product_title_word_count'] = gl.text_analytics.count_words(train['stem_product_title'])
train_title_tfidf = gl.text_analytics.tf_idf(train['product_title_word_count'])
train['title_tfidf'] = train_title_tfidf

In [48]:
train['distance_desc'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
#train['distance_desc_sqrt'] = train['distance_desc'] ** 2
train['distance_title'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))
#train['distance_title_sqrt'] = train['distance_title'] ** 3

In [50]:
model1 = gl.linear_regression.create(train, target = 'relevance', 
                                         features = ['distance_desc', 'distance_title'], 
                                         validation_set = None)
# model1 = gl.linear_regression.create(train, target = 'relevance', 
#                                         features = ['distance_desc', 'distance_desc_sqrt', 'distance_title', 'distance_title_sqrt'], 
#                                         validation_set = None)


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 74067
PROGRESS: Number of features          : 2
PROGRESS: Number of unpacked features : 2
PROGRESS: Number of coefficients    : 3
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.054827     | 1.934252           | 0.502806      |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:

In [51]:
#let's take a look at the weights before we plot
model1.get("coefficients")


Out[51]:
name index value stderr
(intercept) None 3.36098120617 0.0130126162735
distance_desc None -0.471115683671 0.0175727545026
distance_title None -0.792854603251 0.0120369738892
[3 rows x 4 columns]

In [53]:
test['search_term_word_count'] = gl.text_analytics.count_words(test['stem_search_term'])
test_search_tfidf = gl.text_analytics.tf_idf(test['search_term_word_count'])
test['search_tfidf'] = test_search_tfidf
test['product_desc_word_count'] = gl.text_analytics.count_words(test['stem_product_description'])
test_desc_tfidf = gl.text_analytics.tf_idf(test['product_desc_word_count'])
test['desc_tfidf'] = test_desc_tfidf
test['product_title_word_count'] = gl.text_analytics.count_words(test['stem_product_title'])
test_title_tfidf = gl.text_analytics.tf_idf(test['product_title_word_count'])
test['title_tfidf'] = test_title_tfidf

test['distance_desc'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
#test['distance_desc_sqrt'] = test['distance_desc'] ** 2
test['distance_title'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))
#test['distance_title_sqrt'] = test['distance_title'] ** 3

In [54]:
'''
predictions_test = model1.predict(test)
test_errors = predictions_test - test['relevance']
RSS_test = sum(test_errors * test_errors)
print RSS_test
'''


Out[54]:
"\npredictions_test = model1.predict(test)\ntest_errors = predictions_test - test['relevance']\nRSS_test = sum(test_errors * test_errors)\nprint RSS_test\n"

In [55]:
predictions_test = model1.predict(test)
predictions_test


Out[55]:
dtype: float
Rows: 166693
[2.1194586905641986, 2.097010919250315, 2.327318656459769, 2.3423416569379105, 2.291363750454904, 2.1292410028129387, 2.3700891659576886, 2.380158719246286, 2.1461522961232458, 2.6725514683935634, 2.4735444612741126, 2.3577916980297187, 2.527370985769263, 2.5464611241149497, 2.2433784781304618, 2.3610473142083634, 2.097010919250315, 2.615135319749975, 2.1428328384749085, 2.1832873644759365, 2.7538729574608336, 2.7335476206465064, 2.1493438560051157, 2.3267645430460764, 2.2512091489393167, 2.503199755290125, 2.097010919250315, 2.2981432458210644, 2.3803635622873522, 2.322596343031413, 2.519767096157715, 2.362660486577712, 2.1974497974380167, 2.309948689847278, 2.313598821940017, 2.341687147636327, 2.4205333515242975, 2.3366063448390766, 2.8853671419333744, 2.8633757709368384, 2.2552865952704444, 2.297532949563152, 2.165997301067405, 2.097010919250315, 2.4552670270468595, 2.3625494876731397, 2.5106462498135387, 2.6188007396757573, 2.61900376832135, 2.2454680169370245, 2.1036340833149754, 2.102527092843549, 2.122092869812211, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.103358463836085, 2.097010919250315, 2.097010919250315, 2.1267995136846127, 2.1269757879601405, 2.097010919250315, 2.4744936795651946, 2.4089294758879447, 2.360521564894908, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.224736363810293, 2.097010919250315, 2.557408344984125, 2.243072745659523, 2.097010919250315, 2.097010919250315, 2.5501178083366147, 2.097010919250315, 2.480335363483051, 2.283461192815867, 2.283461192815867, 2.1872823298107584, 2.1683105172043, 2.3396260222904397, 2.387438457989694, 2.505582843823445, 2.418305094666935, 2.5793821608777128, 2.3349132621118036, 2.097010919250315, 2.5821088945080928, 2.3188142743790774, 2.270976568556923, 2.242832290300672, 2.4621407613916673, 2.338149998696164, 2.3557175413828784, ... ]

In [56]:
submission = gl.SFrame(test['id'])

In [57]:
submission.add_column(predictions_test)
submission.rename({'X1': 'id', 'X2':'relevance'})


Out[57]:
id relevance
1 2.11945869056
4 2.09701091925
5 2.32731865646
6 2.34234165694
7 2.29136375045
8 2.12924100281
10 2.37008916596
11 2.38015871925
12 2.14615229612
13 2.67255146839
[166693 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [58]:
submission['relevance'] = submission.apply(lambda x: 3.0 if x['relevance'] > 3.0 else x['relevance'])
submission['relevance'] = submission.apply(lambda x: 1.0 if x['relevance'] < 1.0 else x['relevance'])

In [59]:
submission['relevance'] = submission.apply(lambda x: str(x['relevance']))

In [60]:
submission.export_csv('../data/submission2.csv', quote_level = 3)

In [ ]:
#gl.canvas.set_target('ipynb')